Navigating AI in Data Analysis: Suggestions and Cautions

Clay Ford

2025-11-06

Doing hard stuff

  • AI can be great for helping us do hard stuff we need to do, but that we either don’t want to do or that we’re not sure how to do right:

    • letter of recommendation
    • write a regular expression
    • translate R code to Python
  • In all examples, I can verify that an AI solution works.

  • For research purposes, statistics and data analysis can qualify as “hard stuff we need to do but we’re not sure how to do right.”

  • Can ask AI for help with stats, but how do you verify if it’s correct?

Prevailing AI advice for statistical analysis

Mostly safe AI tasks…

  • generating computer code for data wrangling
  • explaining computer code
  • debugging computer code

Potentially dangerous AI tasks…

  • planning a study (power and sample size)
  • choosing a statistical analysis
  • generating computer code for statistical analysis
  • interpreting analysis results

I will demonstrate some of the dangers today.

The chosen tools

  • I present results from Claude and Copilot
  • I have a paid subscription to Claude (courtesy of UVA Library)
  • UVA provides a secure and protected Copilot license (includes contractual data protection of university information)
  • I don’t endorse these tools or claim they are better than others available
  • I encourage you to try the following examples with other AI tools

Statistical planning of study

Two most asked questions:

  • “How many observations do I need?”
  • “I’ll have X observations. What’s the power of my study?”

Planning example using Claude

Claude gives wrong answer

Right answer using R

library(pwr)
pwr.anova.test(k = 3, f = 0.25, sig.level = 0.01, power = 0.9)

     Balanced one-way analysis of variance power calculation 

              k = 3
              n = 94.48714
              f = 0.25
      sig.level = 0.01
          power = 0.9

NOTE: n is number in each group

Notice this is a much larger sample size, over 4 times larger than what Claude tells us.

Asking Claude for R code

Claude gives correct R code

# Parameters
k <- 3          # number of groups (fuel types)
alpha <- 0.01   # significance level
power <- 0.90   # desired power
f <- 0.25       # Cohen's f for medium effect size

# Calculate sample size per group
result1 <- pwr.anova.test(k = k, f = f, sig.level = alpha, power = power)
result1

     Balanced one-way analysis of variance power calculation 

              k = 3
              n = 94.48714
              f = 0.25
      sig.level = 0.01
          power = 0.9

NOTE: n is number in each group

But promises the wrong answer

The code confirms you need 95 trials per fuel type, not 22.

Planning example using Claude

Again, Claude gives wrong answer. Power is too high.

Right answer using R

library(pwr)
pwr.t.test(n = 30, d = 0.5, sig.level = 0.05, 
           type = "two.sample", alternative = "two.sided")

     Two-sample t test power calculation 

              n = 30
              d = 0.5
      sig.level = 0.05
          power = 0.4778965
    alternative = two.sided

NOTE: n is number in *each* group

Power is 0.48. This study is much less powerful than Claude states.

Claude gets more wrong

The last part of its answer:

This is also wrong. For 80% power we need 64 students per group, not 32.

Right answer using R

pwr.t.test(d = 0.5, sig.level = 0.05, power = 0.8,
           type = "two.sample", alternative = "two.sided")

     Two-sample t test power calculation 

              n = 63.76561
              d = 0.5
      sig.level = 0.05
          power = 0.8
    alternative = two.sided

NOTE: n is number in *each* group

Following up with Claude

No, it used the wrong sample size per group (60 instead of 30).

Planning example using Claude

This is wrong. Much too big by a factor of 5.

Right answer using R

library(pmsampsize)
pmsampsize(type = "b", csrsquared = 0.288, parameters = 24, prevalence = 0.17)
NB: Assuming 0.05 acceptable difference in apparent & adjusted R-squared 
NB: Assuming 0.05 margin of error in estimation of intercept 
NB: Events per Predictor Parameter (EPP) assumes prevalence = 0.17  
 
           Samp_size Shrinkage Parameter CS_Rsq Max_Rsq Nag_Rsq  EPP
Criteria 1       623     0.900        24  0.288   0.598   0.481 4.41
Criteria 2       667     0.906        24  0.288   0.598   0.481 4.72
Criteria 3       217     0.906        24  0.288   0.598   0.481 1.54
Final            667     0.906        24  0.288   0.598   0.481 4.72
 
 Minimum sample size required for new model development based on user inputs = 667, 
 with 114 events (assuming an outcome prevalence = 0.17) and an EPP = 4.72 
 
 

Minimum sample size is about 667.

Asking Claude for R code

Claude gives the right code (almost). The rsquared argument should be csrsquared.

Claude means well

After providing mostly correct R code, says it “should give you n = 3,661 as the minimum sample size.” That’s wrong, but everything else is correct.

Planning example using Copilot

This is correct!

Verify Copilot calculation

power.t.test(delta = 0.75, sd = 2.25, sig.level = 0.01,
             power = 0.9, type = "two.sample", alternative = "two.sided")

     Two-sample t test power calculation 

              n = 269.4929
          delta = 0.75
             sd = 2.25
      sig.level = 0.01
          power = 0.9
    alternative = two.sided

NOTE: n is number in *each* group

Planning example using Copilot

This is very wrong. That’s 4 times too high.

Right answer using R

pwr.2p.test(h = ES.h(p1 = 0.3, p2 = 0.4),
            sig.level = 0.05, 
            power = 0.90, 
            alternative = "two.sided")

     Difference of proportion power calculation for binomial distribution (arcsine transformation) 

              h = 0.2101589
              n = 475.8065
      sig.level = 0.05
          power = 0.9
    alternative = two.sided

NOTE: same sample sizes

Asking Copilot for R code

This is the correct code.

Planning example using Copilot

This is wrong.

Right answer using R

An odds ratio of 3 implies P(diabetes|X = 1) = 0.25.

library(WebPower)
wp.logistic(n = NULL, p0 = 0.1, p1 = 0.25,
            alpha = 0.05, power = 0.8,
            alternative = "two.sided",
            family = "Bernoulli",
            parameter = 0.3)
Power for logistic regression

     p0   p1     beta0    beta1        n alpha power
    0.1 0.25 -2.197225 1.098612 218.8331  0.05   0.8

URL: http://psychstat.org/logistic

We need to sample about 220 subjects.

Asking Copilot for R code

This is also wrong. Right package but nonexistent function.

Corrected R code

library(powerMediation)
SSizeLogisticBin(p1 = 0.1, p2 = 0.25, B = 0.3, 
                 alpha = 0.05, power = 0.8)
[1] 223

I had to read the documentation of the powerMediation package, find the correct function (SSizeLogisticBin()), and then figure out how to use the function.

Copilot wants to help

After providing a wrong answer and wrong R code, it asks the following:

If it couldn’t handle a simple logistic regression study plan, why would I trust it with a more complex plan?

Let’s take it up on its offer.

Study plan with covariates

It proceeded to supply sound advice.

Copilot R code for study plan with covariates

The powerLogisticBin() does exist, but does not allow adjustment for covariates. Also the example still uses the nonexistent ssize.logistic() function.

Study planning with AI

Friendly advice:

  • Request code
  • Verify code works
  • Verify code is correct, perhaps consulting with an expert, textbook, or journal article

Choosing a statistical analysis

Depends on…

  • The research question
  • Subject-matter expertise
  • The outcome(s) of interest and how it’s measured
  • The amount of data available
  • The researcher’s comfort level with statistics

Analysis example

In the 1970s, the US Commission on Civil Rights examined charges by Chicago community organizations that insurance companies were redlining their neighborhoods.

To what extent does racial composition of a community affect underwriting practices after controlling for factors that legitimately affect underwriting such as theft and fire damage?

United States Commission on Civil Rights 1979 report:
Insurance Redlining: Fact Not Fiction

Analysis without AI

Julian Faraway reanalyzes this data in his book Linear Modeling with R (Ch 13).

A slightly modified version of this analysis is available at the following link:

https://static.lib.virginia.edu/statlab/materials/redlining_analysis.html

Letting AI guide the analysis

Copilot offers to analyze data for you.

Let’s give it a try!

Letting Copilot guide the analysis

When I let Copilot analyze the insurance redlining data:

  • visualized zipcode like a number
  • didn’t suggest log-transforming income
  • suggested logistic regression and decision trees as a possibility
  • offered to fit interactions (sample size is too small)
  • miscalculated VIFs
  • offered to drop variables with high VIFs, which didn’t need to be done
  • offered to do stepwise regression, which is bad statistical practice
  • did not preserve model after stepwise regression
  • dropped almost 10% of the data because it was moderately influential
  • identified fire as the “strong” predictor when it’s simply a covariate

Letting Copilot assist with the analysis

Instead of letting Copilot guide the analysis, ask it to help with specific tasks:

  • help automate the creation of plots
  • help with code to test all 16 combinations of covariates

Statistical analysis with AI

Friendly advice:

  • Give AI small, explicit tasks
  • Request code, verify it works
  • If requesting code for running analysis, be sure it’s the correct analysis
  • Disable or discourage suggestions for further analysis, or exercise healthy skepticism when it makes suggestions

General AI advice:

  • Provide context for the analysis
  • Run multiple chats to check consistency
  • Ask for sources and then check the sources (sometimes the summaries it gives don’t match what’s actually in the source that it cites)
  • Acknowledge the use of AI in your final paper, report or presentation

Final thought

“…we strongly warn about the non-skeptical use of LLMs. Relying naively on the correctness of their output is irresponsible, and statistical studies can be seriously corrupted. Consequently, expert knowledge from biostatisticians remains indispensable, along with maintaining a questioning stance towards AI outputs.”

Dobler, et al. (2025). “ChatGPT as a Tool for Biostatisticians: A Tutorial on Applications, Opportunities, and Limitations,” Statistics in Medicine.

Workshop complete

For statistics help, contact UVA Library StatLab:

Thank you to Jenn Huck, Hyeseon Seo, Lauren Brideau, and Ethan Kadiyala for suggestions that improved this presentation!










This work is licensed under a Creative Commons Attribution 4.0 International License.

References